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ABSTRACT 


In the skeleton-based action recognition, mining information from the joints and limbs of human skele- 
tons plays a key role. Previous studies treated the skeleton data as vectors and could not explicitly cap- 
ture the joint interactions (e.g., RNN-based methods), or modeled the joint interactions in a local manner 
and may lose important cues without global response mapping (e.g., CNN and GCN (Graph Convolution 
Network) based methods). In this work, we address these problems by considering the potential rela- 
tions of all the node pairs and edge pairs on the skeleton graphs. A dilation group-specific convolution 
module is proposed to aggregate relation messages of all the unit pairs on the skeleton graphs. By enu- 
merating all the pair relations, the joint interactions could be learned explicitly and globally. It is then 
enhanced by introducing the attention pooling including temporal attention, spatial attention and chan- 
nel attention. By stacking such several blocks, the relation messages of the node pairs are augmented 
by multi-layer propagation. Finally, the late fusion of four streams is used to combine the predictions 
of different inputs including node pairs, edge pairs and corresponding frame differences. The proposed 
method, termed conv-relation network, achieves competitive performance on two large scale datasets, 


NTU RGB-+D and Kinetics. 


1. Introduction 


Recent years have witnessed deep learning [1] being widely 
popular in the vision community, e.g., image classification |1], ob- 
ject detection |2], video classification [3], pose recovery [4], stereo 
matching [|5], image ranking [6] and visual question answering |7]. 
Particularly, human action recognition has received increasing at- 
tention due to potential applications in human-robot interaction, 
behavior analysis and surveillance. According to the types of in- 
put data, human action recognition can be categorized into RGB- 
based [3,8-12] and skeleton-based approaches [13-21]. Compared 
with RGB images, skeleton data has the merits of being lightweight 
and robust against background noise. 

In this paper, we focus on the problem of skeleton-based hu- 
man action recognition. The interactions of skeleton joints play a 
key role in characterizing an action. Traditional methods | 13-15] 
design hand-crafted features to extract co-occurrence patterns 
from skeleton sequences. With the resurgence of neural networks, 
Recurrent Neural Networks (RNN) and Convolution Neural Net- 
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works (CNN) have been widely used in the skeleton-based action 
recognition [16-20]. The RNN based methods [16-18] transform 
the skeleton data into a joint coordinates vector and then capture 
the sequence information of skeleton. Compared with RNN, CNN 
has good parallel ability and can benefit from pretraining on the 
large scale datasets. CNN-based methods [19,20] primarily repre- 
sent a skeleton sequence as a pseudo-image and recognize the un- 
derlying action in the same way as image classification. However, 
local convolution cannot learn the global joint interactions effi- 
ciently and the underlying assumption of the joints being adjacent 
Spatially in the input tensor may introduce unreliable prior. Re- 
cently, graph convolution networks (GCN) based methods [21] have 
been used to capture joint interactions on the skeleton graphs, ex- 
plicitly considering the adjacent relationship between joints in a 
non-Euclidean space. Nevertheless, the skeleton graphs and their 
manually designed convolution kernels also limit joint interaction 
to being learned in a local manner. This motivates us to design a 
model which can break the limit of local convolution and learn the 
joint interactions explicitly and globally. 

In this paper, this problem is solved by considering the poten- 
tial relations of all the node pairs and edge pairs on the skele- 
ton graphs. Firstly, a dilation group-specific convolution module 
is proposed to compute the interactions of all the node pairs on 
the skeleton graphs. By using this module, convolutions can be 
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Fig. 1. Convolutional relation network for skeleton-based action recognition. The relations of node pairs, edge pairs and corresponding frame differences on the skeleton 
graphs are explicitly modeled respectively. Due to the space limit, we only show one building block of our model in the top, middle figure. The details of this block are given 


in Fig. 2. The late fusion of four streams is used to get the final prediction. 


explicitly performed on all the node pairs. In this way, joint in- 
teractions are completely captured by pair interactions in a dense 
convolutional manner. Secondly, the attention pooling operations, 
including temporal attention, spatial attention and channel atten- 
tion, are used to enhance the module. Finally, the late fusion of 
four streams is used to combine the predictions of different in- 
puts including node pairs, edge pairs and corresponding frame dif- 
ferences. The proposed method, termed conv-relation network, is 
shown in Fig. 1. 

The main contributions of this work can be summarized as fol- 
lows: 


e A dilation group-specific convolution module is proposed for 
aggregating dense relations of all the node pairs of the skeleton 
graphs. Besides, three attention models (i.e., temporal, spatial 
and channel attention) are proposed to enhance this convolu- 
tion module. 

e Different inputs on the skeleton graphs, including node pairs, 
edge pairs and their corresponding frame differences, are com- 
bined by the late fusion to improve the effectiveness. 

e On two large scale datasets for skeleton-based action recogni- 
tion, the proposed conv-relation network achieves competitive 
performance. 


2. Related work 
2.1. Skeleton-based action recognition 


Hand-crafted features. Traditional methods designed handcrafted 
features to capture the dynamics of joint motion. These could be 
actionlet ensemble [13], covariance matrices of joint trajectories 
[14], rotations and translations between body parts [15], or tem- 
poral order information [22]. 

Deep learning methods. With the resurgence of neural networks, 
the long short term memory networks (LSTM) have been adopted 
for feature learning from skeleton sequences due to its ability 
of modeling long-term temporal dependency. In |17], a spatial- 
temporal LSTM model based on gating mechanism is proposed to 
filter out unreliable input due to the occlusion and sensor noise. 
[18] proposed an end-to-end fully connected deep LSTM network 
to learn co-occurrence features from skeleton data. In [23], a view- 
adaptive LSTM was introduced to transform the skeleton data to 
a suitable view adaptively for better action recognition. In recent 
years, more and more literatures adopted CNN for skeleton fea- 
ture learning, due to the benefits of transferring knowledge from 
large scale image dataset. For example, |19] quantified the skele- 
ton sequences into images and then fed them into CNN. In [24], 
a view-adaptive CNN was introduced to deal with the view varia- 


tion challenge. Because of the operation of local convolution, CNN- 
based methods cannot capture the global relationship of all joints 
efficiently. Moreover, assuming the joints being adjacent in the Eu- 
clidean space may introduce unreliable prior. Human skeletons in 
videos can be treated as a spatial-temporal graph. Recently, graph 
convolution neural networks have been used to learn the spatial 
parts of human skeletons [21], which explicitly considers the adja- 
cent relationship between joints in a non-Euclidean space. Never- 
theless, the skeleton graphs and their manually designed convolu- 
tion kernels also limit joint interaction to being learned in a local 
manner. 


2.2. Video-based action recognition 


One important stream of action recognition is video-based ac- 
tion recognition, where methods based on Deep CNN have dom- 
inated. Two-stream ConvNet |3] employs RGB images and optical 
flow stacks as the inputs of two networks and fuses their predic- 
tions by late fusion. Temporal Segment Network (TSN) [8] improves 
the performance of two-stream ConvNet by sparsely sampling 
video frames and learning video-level predictions. In 2017, Deep- 
Mind released a large-scale video action datasets Kinetics [25] and 
proposed Inflated 3D ConvNet (I3D). 

Few-shot learning based methods [26] for action recognition, 
which aim to relieve the need for large amount of annotated data 
in the existing deep learning methods, have also developed rapidly. 
In traditional methods, the training and testing sets involve the 
same classes of samples. In a few-shot recognition setting, the net- 
work needs to effectively learn classifiers for novel concepts from 
only a few examples. Unlike traditional models trained on many 
data samples, the model in a few-shot setting is trained to gener- 
alize across different episodes. 

Numerous methods for temporal context modeling in videos are 
also proposed, such as [27,28]. In this paper, the context informa- 
tion is obtained by the temporal convolution on the short clips, 
typically, less than 5 s. Note that this paper aims at exploring the 
information among skeleton joints and limbs (spatial) and the tem- 
poral context modeling is not the focus. 


2.3. Relation reasoning 


A simple plug-and-play module, Relation Network (RN) 
[29] was proposed to equip CNN with relation reasoning ability in 
several tasks. Recurrent relational networks [30] increased its abil- 
ity of solving the tasks that require an order of magnitude more 
steps of relational reasoning. Non-local neural network [31] was 
designed to equip CNN with the ability of long range relation 
reasoning, including spatially in images and spatial-temporally in 
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Fig. 2. The basic building block of convolutional relation module for feature extrac- 
tion. It mainly consists of dilation group-specific 1 x 2 conv, Temporal Convolution 
(TCN), attention models. 


videos. Temporal relation reasoning neural network (TRN) [32] was 
designed to learn and reason about the temporal dependencies of 
video frames at multiple time scales. The relations in these works 
were used to represent the underlying object interactions in spatial 
domain [29,30], temporal domain [32], or spatial-temporal domain 
[31]. In this paper, the relations of the skeletons joints are realized 
by 1 x 2 convolution in the spatial domain and the interactions in 
temporal domain are realized by the temporal convolutions. 


3. Convolutional relation network 


Fig. 2 shows the basic module of our conv-relation network. The 
module consists of 3 layers including dilation group-specific con- 
volution with 1x2 kernel, temporal convolution with 9x1 ker- 
nel and attention pooling. A residual connection is added for each 
block. The dilation group-specific convolution module is designed 
for unit pairs (i.e., nodes, edges on graphs) relation extraction. By 
this module, the local feature aggregation of convolution is re- 
placed by the global feature aggregation. The coordinates of dif- 
ferent skeleton joints are allowed to explicitly interact with each 
other from the low layers of network. The number of joints cap- 
tured by sensors or estimated by algorithms for skeleton-based ac- 
tion recognition is less than 30, e.g., 25 for NTU RGB+D, 18 for 
Kinetics, thus enumerating all the pair relations is affordable. 

The conv-relation network is stacked with the blocks shown in 
Fig. 2. As shown in Table 1, there are 1 data batch normalization 
layer and 9 convolution relation modules. After that, global aver- 
age pooling is performed and the final output is sent to a SoftMax 
classifier. 


3.1. Dilation group-specific 1 x 2 convolution 


As shown in Fig. 3, several groups of convolutions are used to 
enumerate all the unit pairs and get their relations. Different di- 
lations are set in the different groups of convolutions, from 1 to 
R respectively. For the i,, group of convolutions, its dilation is set 
as i, where i= 1,2,...,R. That is, each group of convolutions has 
the same dilation as its group order. R denotes the number of 
groups of convolutions. When the inputs are nodes and edges on 
the skeleton graphs, the total number of groups of convolution R 


Table 1 

The structure of the conv-relation network. The number of 
input channels, output channels and the stride of each block 
are shown in the table. Data BN represents the data batch 
normalization layer. GAP represents the global average pool- 


ing layer. 
Input channels Output channels stride 
Data BN 3 3 - 
L1 3 72 1 
L2 72 72 1 
L3 72 72 1 
L4 72 144 2 
L5 144 144 1 
L6 144 144 1 
L7 144 288 2 
L8 288 288 1 
L9 288 288 1 
GAP = = = 
softmax - - - 


equals to (V — 1) and (V — 2) respectively, where V is the num- 
ber of skeleton joints. 

Given an input tensor with the shape of NxCxTxV, it will 
be convolved by (V — 1) groups of 1 x 2 convolution for (V — 1) 
times. Normally, the convolution output corresponding to the jen 
joint and the i,, group can be represented as 


out put;; = conv;(j, (j + I)AN) (1) 


where conv;,(x, y) denotes performing 1 x 2 convolution on the xX), 
Yr Spatial dimension of the input tensor with i,, group convolu- 
tion, and then the result is assigned to output,;,. The modular op- 
eration in Eq. (1) is implemented as padding the i,, spatial dimen- 
sion of the input tensor to the end of original input tensor incre- 
mentally before preforming the i,, group convolution. When the im 
group convolution is finished, its final result is 


out put; = concat(conv;(j, (j +i1)%N)), j=1,2,...,V (2) 


which is the convolution operation sliding on the spatial dimen- 
sion of the input tensor. After finishing each group of 1 x 2 convo- 
lution, all outputs are concatenated or summed along the channel 
dimension. 


output = f(output;), i=1,2,...,V (3) 


where f denotes concatenating or summing the list of input ten- 
sors along the channel dimension. These two fusion methods have 
the same output dimension in our implementation. In the case of 
concatenation fusion, each group of dilation group-specific convo- 
lutions has output channel M/D, where M is the predefined num- 
ber of output channel and D is the number of groups. So the to- 
tal number of output channel after concatenation is M/DxD = M. 
In the case of the summation fusion, each group of convolutions 
has output channel M, so the total number of output channel after 
summation is M. Typical values of M and D are 72 and 24 respec- 
tively, as shown in Table 1, layer L1, L2 and L3. 

Note that the above operations are performed on the V (spa- 
tial) dimension of the input tensor and shared on its T (temporal) 
dimension. Meanwhile, the temporal convolution network (TCN) 
with 9 x 1 kernel is applied to the T dimension of the input tensor 
and shared on its V dimension. 


3.2. Attention pooling 


Enumerating all the pair relations is likely to cause over-fitting 
due to the sensor noise of skeleton extraction. Furthermore, not all 
nodes and frames are needed to discriminate an action. Thus, three 
attention models are used to deal with such issues. A SE-block 
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Fig. 3. An illustration of Dilation group-specific 1 x 2 convolution. In this example, there are 25 joints. Hence there are 24 groups of 1 x 2 convolution and each of them has 
dilation the same as its group order. The convolution outputs of all groups are concatenated or summed along the channel dimension to get the final output. 
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Fig. 4. Three attention modules, temporal attention, spatial attention, channel attention. 


[33] is used to realize the attention module. SE-block is a bottle- 
neck with 2 small fully connected layers whose final activation is 
sigmoid. It is originally used to adaptively recalibrate channel-wise 
feature responses. Here, it is employed to gate the channel, spa- 
tial and temporal dimension of an input tensor with the shape of 
Cx Tx V. Three attention models are shown in Fig. 4, where C de- 
notes the channel dimension, T denotes the temporal dimension 
and V denotes the spatial dimension. 

The operations of attention models consist of four steps, re- 
shaping the input tensor, global average pooling, forward passing 
through two fully connected layers and channel-wise scaling. 

Basic parameters of these operations are listed in Table 2. Take 
the temporal attention for example, the input tensor is firstly re- 
Shaped to Nx Tx(VxC). Then the global average pooling is per- 


formed on the dimension of (Vx C). The obtained tensor is with 
the shape of N xT. The final gating output is with the shape of 
N x T. It denotes that the temporal attention is sample (video) spe- 
cific. The similar analysis can also be applied to the spatial atten- 
tion and channel attention. 

Finally, the parallel multiplication is used to fuse these three at- 
tention models. As shown in Fig. 5, all attention weights are mul- 
tiplied with the input tensor in a parallel fashion. 


3.3. Input modalities 


The main purpose of dilation group-specific convolution is to 
obtain all the joint interactions. However, the input information 
only relies on the single joint of skeletons. The natural connections 
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Table 2 
Basic parameters of three attention models. 


Attention Reshape Global average pooling 
dimension 

Temporal NxTx(VxC) (VxC) 

Spatial (NxT)xVxC €C 

Channel (NxTxV)xC Without pooling 
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Fig. 5. Parallel attention fusion. All the attention weights are multiplied with the 
input tensor channel-wisely. 


on the skeleton graphs are neglected, which may be important for 
capturing bone interactions. We also use the graph edges as the 
input of conv-relation network. Formally, each edge of the graph is 
the concatenation of its two nodes along the channel dimension. 
Then each edge is taken as the new spatial dimension of the input 
tensor with the shape of N x 2C x T x (V — 1). Inspired from the 
optical flow stream in the two-stream networks [3], the frame dif- 
ferences are also considered as the input of conv-relation network 
to encode the explicit dynamics of skeletons. The frame differences 
are applied both on the nodes and edges respectively. That is, four 
kinds of inputs of the skeleton graphs, nodes, edges, node differ- 
ences and edge differences, are taken as the inputs of conv-relation 
network. The final scores of four streams before softmax are aver- 
aged to get the final prediction. 


4. Experiments 
4.1. Datasets 


NTU RGB+D [16]. This Kinect captured dataset is currently the 
largest dataset with RGB+D videos and skeleton data for human 
action recognition, with 56,880 video samples. It contains 60 dif- 
ferent action classes including daily, mutual, and health-related 
actions. Each subject has 25 joints. The various setups of cam- 
eras, capturing views, and different facing orientations of the sub- 
jects, result in a great diversity of sample viewpoints. NTU con- 
tains two standard evaluations: Cross-Subject (CS) and Cross-View 
(CV). Specifically, Cross-Subject (CS) consists of 40 subjects that are 
randomly split into the training and the testing groups, while the 
training and the testing group of Cross-View (CV) come from the 
samples of cameras 2, 3 and camera 1 respectively. It is a challeng- 
ing dataset for action recognition because of the large amount of 
videos, various subjects, and the difference of camera views. The 
top-1 recognition accuracy on both benchmarks is reported. 

Kinetics [25]. Kinetics is a large-scale human action dataset 
which contains 300,000 videos clips in 400 classes. The video clips 
are sourced from YouTube videos. It only provides raw video clips 
without skeleton data. [21] estimate the location of 18 joints on 
every frame of the clips with the publicly available OpenPose tool- 
box [34]. Two persons are selected for multi-person clips based on 


Output shape Characteristics 


NxT sample specific 
(NxT)xV sample and temporal specific 
(NxTxV)xC sample, temporal and spatial specific 


the average joint confidence. Our model is evaluated on their re- 
leased data. The dataset is divided into training set (240,000 clips) 
and validation set (20,000 clips). The top-1 and top-5 accuracies 
on the validation set are reported. 


4.2. Experimental details 


In order to make the model being insensitive to the initial 
position of an action, for each sequence, skeletons are normalized 
by subtracting the central joint of the first frame. The central joint 
is the average of 3D coordinates of the hip center, hip left and 
hip right. For the NTU RGB+D, average sequence length is about 
80 and each sequence is divided into 32 segments. The number 
of segments is obtained by evaluating the model performance on 
a small validation set divided from training set. During training 
phase, we randomly sample a number from the range index of 
each segments and do bilinear interpolation. During testing phase, 
the center frame of each segment is sampled. For the Kinetics, 
its average sequence length is about 250. During training phase, 
a subsequence with the length of 150 is randomly cropped from 
the sequence. During testing phase, at most 2 subsequences are 
cropped from a sequence, since a sequence has a length of less 
than 300. The same bilinear interpolation is applied for Kinetics. 
The batch size is set to 64 and 4 GPUs are used for data parallel. 
The models are learned using stochastic gradient descent with an 
initial learning rate of 0.1. For the NTU RGB+D, the learning rate 
is decayed by 0.1 according to a schedule of [15, 60] and the total 
number of epochs is 90. For the Kinetics, the base learning rate is 
set to 0.2 and is decayed by 0.1 according to a schedule of [20, 50], 
the total number of epochs is 60. All experiments are conducted 
on PyTorch with 4 TITAN X GPUs. Code will be released. 

The random view data augmentation is employed at the se- 
quence level. Specifically, the skeleton is rotated around the X, Y 
and Z axis by some degrees which are generated randomly from 
—10 to 10. The probability of doing random view augmentation 
is 50%. 


4,3. Ablation study 


In this section, we verify the effectiveness of the proposed com- 
ponents in convolution relation network by three experiments on 
the cross view of NTU RGB+D dataset. 


4,3.1. Convolutional relation module and Attention pooling 

Firstly, the necessity of convolution relation module on the 
Skeleton graphs is evaluated, as shown in Table 3. A baseline net- 
work is designed for comparison where all node interactions are 
replaced by 1 x1 convolution, i.e. without joint interactions. For 
the conv-relation network, concatenation is adopted for relation 
channel fusion, shown as Concat relation. Concat relation outper- 
forms the Conv1 x 1 by more than 3% in accuracy. It demonstrates 
that modeling joint interactions is important for skeleton-based 
action recognition. In addition, attention pooling constantly im- 
proves the performance of Concat relation and further parallel fu- 
sion leads to better performance. The relation channel fusion using 
summation achieves comparable performance as Concat relation. 
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Table 3 
Recognition accuracy of different configurations of conv- 
relation module on the NTU RGB-+D (cross view) dataset. 


Top1 
Conv1 x 1 87.3% 
Concat relation 91.1% 
Concat relation, only with temporal attention 91.4% 
Concat relation, only with channel attention 91.3% 
Concat relation, only with spatial attention 91.5% 
Concat relation, with all attention 91.8% 
Sum relation, with attention 91.9% 


Table 4 

Recognition accuracy of conv-relation network 
with different inputs on the NTU RGB+D (cross 
view) dataset. 


Input Top1 

Node 91.9% 

Edge 91.9% 

Node diff 92.1% 

Edge diff 92.2% 

Four Streams 94.5% 
Table 5 


Performance on the NTU RGB+D dataset. X-Sub and X-View 
are the cross subject split, cross view split of NTU RGB+D 


respectively. 
X-Sub = X-View 
Lie Group [15] 50.1% 52.8% 
H-RNN [35] 59.1% 64.0% 
Deep LSTM [16] 60.7% 67.3% 
PA-LSTM [16] 62.9% 70.3% 
ST-LSTM+TS [|17] 69.2% 77.7% 
Temporal Conv [36] 74.3% 83.1% 
C-CNN+MTILN [20] 79.6% 84.8% 
ST-GCN [21] 81.5% 88.3% 
Conv-Relation, node (ours) 82.5% 91.9% 
Conv-Relation, edge (ours) 83.5% 91.9% 
Conv-Relation, node, diff (ours) 82.3% 92.1% 
Conv-Relation, edge, diff (ours) 84.0% 92.2% 


Conv-Relation, Four Streams (ours) 86.2% 94.5% 


By default, Sum relation with three attentions is used in the fol- 
lowing experiments, unless otherwise specified. 


4.3.2. Fusion of four streams 

Another important improvement is the utility of different in- 
puts, as shown in Table 4. The performances of using each kind of 
input data alone are compared, shown as Node and Edge, and its 
corresponding frame differences, shown as Node diff and Edge diff 
respectively. The performance of combining four streams is shown 
as Four streams. It demonstrates that combining the four kinds of 
data as input outperforms one stream based methods, which ver- 
ifies the importance of natural edge connection and frame differ- 
ence information for skeleton based action recognition. 


4.4. Comparison with state-of-the-arts 


As shown in Table 5, on NTU RGB+D dataset, our model is 
compared with Lie Group [15], Hierarchical RNN [35], Deep LSTM 
[16], Part-Aware LSTM (PA-LSTM) [16], Spatial Temporal LSTM with 
Trust Gates (STLSTM+ TS) [17], Temporal Convolutional Neural Net- 
works (Temporal Conv.) [36], Clips CNN + Multi-task learning (C- 
CNN+MTLN) [20] and ST-GCN [21]. Our conv-relation network with 
single modality and four kinds of inputs as input outperforms pre- 
vious state-of-the-art approaches on this dataset. Specifically, conv- 
relation network with four streams late fusion outperforms ST-GCN 


Table 6 
Performance on the Kinetics dataset. 
Top1 Top5 

Feature Encoding [22] 14.9% 25.8% 
Deep LSTM [16] 16.4% 35.3% 
Temporal Conv [36] 20.3% 40.0% 
ST-GCN [21] 30.7% 52.8% 
Conv-Relation, node (ours) 30.7% 52.99% 
Conv-Relation, edge (ours) 31.5% 54.05% 
Conv-Relation, node, diff (ours) 27.1% 49.44% 
Conv-Relation, edge, diff (ours) 27.8% 50.1% 


Conv-Relation, Four Streams (ours) 33.1% 55.8% 





Fig. 6. Confusion matrix on Kinetics. 


by more than 4% and 6% in top1 accuracy on cross subject and 
cross view respectively. 

On Kinetics, we compare with four characteristic approaches for 
skeleton based action recognition. The first is the feature encoding 
approach on hand-crafted features [22], referred to as Feature En- 
coding in Table 6. The second is based on LSTM, i.e. Deep LSTM 
|16]. The third one is based on CNN, i.e. Temporal ConvNet [36]. 
The last one is based on GCN, i.e. ST-GCN [21]. In Table 6, the conv- 
relation network with single modality and four kinds of inputs as 
input outperforms previous representative approaches. Specifically, 
conv-relation network with the late fusion of four streams outper- 
forms ST-GCN by more than 2% on the metrics of top1 and top5 
accuracy. 


4.5. Confusion analysis on Kinetics 


Due to the low performance of our model on Kinetics, we 
show the confusion matrix of our model (four stream) in Fig. 6. 
For better illustration, we show the top-10 error most classes 
and top-10 right classified classes in Table 7 and Table 8 respec- 
tively. It can be observed from Tables 7 and 8 that our model be- 
haves bad in classes which are full of scenes (e.g., sled_dog_racing, 
cleaning_pool), objects (e.g., counting_money, juggling_balls, play- 
ing monopoly), and much better in classes only involving 


Table 7 


Top-10 error most classes on the Kinetics dataset. Error rate, %. 
inside the brackets of the third column are the error rates of gt classes being 


mis-classified. 
GT classes 


sled_dog_racing 
cleaning_pool 
counting_money 
juggling balls 
headbutting 
snatch_weight_lifting 
bobsledding 
playing_monopoly 
tying_bow_tie 
throwing_ball 


Table 8 


Error rate 


100.0 
100.0 
100.0 
100.0 
100.0 
100.0 
100.0 
100.0 
100.0 
100.0 
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Mis-classified most 


washing_hair(8.0) 
flipping_pancake(10.0) 

Squat( 10.0) 

baking_cookies(8.3) 
catching_or_throwing_baseball(6.0) 
long_jump(6.0) 

Squat(8.0) 

front_raises(12.5) 
making_snowman(26.0) 
catching_or_throwing_baseball(10.0) 


Top-10 right classified classes on the Kinetics 


dataset. Top1 accuracy, %. 


GT classes 


somersaulting 
playing_cymbals 
jumping_into_pool 
eating_carrots 


knitting 


playing_poker 
drop_kicking 


jogging 
busking 


pushing car 


touch other person pocket 
touch neck (neckache) 


point finger at the other person 


giving something to other person 


walkin 
rub two hands gener 


punching/slapping other person 


pat on back of other person 


touch head (headache) 

sala: 10 something with finger 

playing with pnone/tablet 

make a phone call/answer phone 
nausea or vomitin Siloti 
touch chest (stomachache/heart pain) 


check time (from watc 


hugging other person 


put the palms together 
cross hands in front (say stop) 
put something inside pocket / take out something from pocket 


pushing other person 
touch back (backache) 
use a fan (with hand or paper)eeling warm 


Top1 


94.0 
92.0 
86.0 
83.6 
83.6 
83.3 
83.3 
82.0 
80.0 
79.5 


clappi 
hy 
drink water 


reading 


sneeze/cough 
eat meal/snack 
drop 


writing 

wear a shoe 

tear u er 

hand shaking 
typing on a keyboar 
hand waving 


brushing teeth 
wear on glasses 


throw 
brushing hair 


take off a shoe 

nod head/bow 
salute 

take off a donb 
take off glasses 
towards each other 


pickup 
kicking other person 


shake head 


-0.2 -0.15 


The numbers 





human-centric action (e.g., somersaulting, drop_kicking, jogging). 
Because the skeleton-based methods only use person-centric pose 
for action recognition. In these cases, they are complementary to 
those video-based methods. 


4.6. Timing 


Conv-relation network runs at ~ 12 ms (~80 fps) per clip (32 
frames) on an Nvidia TitanX GPU. It is also fast to train. Training 
on the cross-view of NTU RGB+D takes 15 h in our 4-GPU imple- 
mentation. Training on the Kinetics takes about 30 h. 


4.7. Visualization 


For a detailed comparison, we further investigate the per cate- 
gory change in accuracy. Fig. 7 shows the results, where the cat- 
egories are sorted by accuracy gain. It is worth noting that the 
performance of most actions have been improved, especially for 
those involving joint interactions. For example, over 10% absolute 
improvement is observed for touch head, pointing to something 
with finger, making a phone call. Fig. 8 shows some example clips, 
for which our model predictions are true, while the baseline model 
(conv1x1) fails. For the actions without obvious joint interaction 
such as shake head, pat on back of other person, the relation mes- 
sages of joints are not critical. 


-0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 


Fig. 7. Per-category change in accuracy of conv-relation over conv1x1 baseline on the NTU RGB+D dataset in the cross-view setting. For clarity only categories with change 


greater than 1% are shown. 
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Fig. 8. Visualizing some example clips on NTU RGB+D cross-view, for which our model predictions are true, while the baseline model (conv1x1) fails. The clips of three 
action classes, i.e., touch head (headache), pointing to something with finger, making a phone call are shown from top to the bottom. The wrong predictions of the baseline 


model are touch neck (neckache), check time, drop, respectively. 


5. Conclusion 


In this paper, a convolutional relation neural network is pro- 
posed for skeleton-based action recognition. A convolutional-like 
relation module called dilation group-specific convolution is de- 
signed for modeling the potential relations of all the pairs of nodes 
and edges of the skeleton graphs. Attention pooling is used to en- 
hance the relation module. The final prediction is obtained by the 
late fusion of four streams. Extensive experiments are conducted 
to show the effectiveness of our model. On two large datasets, 
Kinetics and NTU RGB+D, our method achieves competitive per- 
formance. Future work would include extending deformable con- 
volution kernel [37] into skeleton-based action, which is perhaps 
a more efficient way of joints interaction modeling than conv- 
relation network. Another direction is combining skeleton data 
with RGB images in a multi-task manner, where the fine-grained 
feature of skeleton data and appearance feature of RGB images 
can complement each other. Moreover, combining the proposed 
method with data selection methods such as active learning [38], 
focal loss [39], online hard negative mining [40] is also a direction. 
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